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Abstract 

We study the dynamics of on-line learning in large {N — > oo) perceptrons, for the case 
of training sets with a structural 0{N'^) bias of the input vectors, by deriving exact and 
closed macroscopic dynamical laws using non-equilibrium statistical mechanical tools. In 
sharp contrast to the more conventional theories developed for homogeneously distributed or 
only weakly biased data, these laws are found to describe a non-trivial and persistently non- 
deterministic macroscopic evolution, and a generalisation error which retains both stochastic 
and sample-to-sample fluctuations, even for infinitely large networks. Furthermore, for the 
standard error-correcting microscopic algorithms (such as the perceptron learning rule) one 
obtains learning curves with distinct bias-induced phases. Our theoretical predictions find 
excellent confirmation in numerical simulations. 

PACS: 87.10.+e 
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1 Introduction 



Rosenblatt first introduced the perceptron and proved the famous perceptron convergence 
theorem in 1962. It is an indicator of the richness of the perceptron as a dynamical system 
that almost 40 years later it continues to yield fascinating results which have hitherto remained 
hidden. Especially during the last decade, considerable progress has been made in understanding 
the dynamics of learning in artificial neural networks through the application of the methods 
of statistical mechanics. The dynamics of on-line learning in perceptrons has been analysed 
intensively, but for the most part such studies Q have been carried out in the idealised scenario 
of so-called complete training sets (in which the number of training examples is large compared 
with A^, the number of degrees of freedom), and have also assumed a homogeneous input data 
distribution. A recent review of work in this field is contained in Q. A general theory of learning 
in the context of restricted training sets (where the size of the training set is proportional to A^) 
is generally much more difficult, although an exact solution of the dynamical equations for the 
more elementary problem of unbiased on-line Hebbian learning with restricted training sets and 
noisy teachers has been found ||^, ||. Nevertheless, substantial progress has been made towards 
a general theory of learning with restricted training sets and the reader may refer, for example, 
to |6|, 0, ^, for details. 

In this paper we consider complete training sets, but we admit the possibility of a structural 
bias of the input vectors. This is a significant issue since in real-world situations a training 
sample will generally have a non-zero average; this is especially important in the case of on- 
line learning, where examples are not available prior to learning, so that one cannot correct 
for any bias prior to processing. This in itself would be sufficient motivation for the present 
study. However, it turns out that the introduction of structurally biased input data leads to 
qualitative (rather than only quantitative) modifications of the actual learning curves observed 
in numerical simulations and the mathematical theories required for their description. Various 
authors |10, 11] have studied so-called clustered examples, in which examples are drawn from 
two Gaussian distributions situated close to each other, with an input bias of order (i.e. in 
magnitude similar to finite-size effects). Learning with input bias has also been considered in in 
the context of linear networks [12|; the linear theory was then used to construct an approximation 
for a class of non-linear models, and it was shown that on-line learning is more robust to input 
bias and out-performs batch learning when such bias is present. 

Here we consider a situation which is more natural and less restrictive than the one considered 
in |p!o|, 11 1, and which does not require the linearity of ||l^: we study the familiar (non-linear) 



perceptron, with the perceptron learning rule and with a structural, i.e. 0{N^), bias in the input 
data. Using B, J, and A to denote the teacher weights, the student weights and the bias vector 
(precise definitions follow), we develop our theory in terms of three macroscopic observables: the 
standard observables Q = , R = J ■ B, and a new observable S = J ■ A (the overlap between 
student weights and bias vector). In contrast to the the dynamics of the bias- free case, we find 
that in the presence of an 0{N^) input bias the system passes though three phases, characterised 
by different scaling of typical times and of macroscopic observables. This could already have 
been anticipated on the basis of numerical simulations, see e.g. figure |l[ We obtain a closed 
system of equations in which the evolution of {Q, R} is deterministic in the limit N ^ oo, as in 
the bias-free case, but where S is (generally) a stochastic variable, whose conditional probability 
distribution Pt{S\Q, R) becomes non-trivial. Phase I is a short phase, in which the system 
reduces the alignment of the student weight vector J relative to the bias vector A. During 
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Figure 1: Evolution of the generalisation error Eg as measured in a single simulation experiment 
of the perceptron rule, with N = 1000, learning rate rf = 1 and bias a = ^, following initial 
conditions Q(0) = 10, i?(0) = and /S'(O) = \/iV (see the text for details). The inset, magnifying 
the early transients, shows the phases I and II. Clearly, no learning takes place in phase I. 



phase I, the observable S is deterministic, is rapidly driven towards zero, and no learning takes 
place. Before the state 5 = is reached, however, the system enters phase II, a very short 
phase in which S evolves stochastically to a quasi-stationary probability distribution (which we 
calculate) and in which both Q and R are frozen. In phase III, where most of the learning takes 
place, the 5 distribution is modified by a non-negligible random walk element, which generates a 
diffusion term in the equation controlling the evolution of Pt{S\Q,R), whereas Q and R satisfy 
coupled differential equations which involve averages over Pt{S\Q,R). The stochastic nature 
of S is reflected in the fact that the generalisation error also exhibits fluctuations (see figure 
H). The (exact) equations describing phase III cannot be further simplified, but we introduce 
an approximation yielding more tractable equations for which still have the merit of 

reducing to the more familiar equations when no-bias is present. Moreover, they are found to 
be in excellent agreement with the results obtained from numerical simulations. Compared to 
the unbiased case, having a finite bias is found to change the pre-factor in the asymptotic power 
law of the asymptotic time-dependence of the generalisation error, but not the exponent. A 
preliminary and more intuitive presentation of some of the present results can be found in . 



2 Definitions 

We study on-line learning in a student perceptron S : { — 1, 1}^ {~^t 1}) which learns a task 
defined by a teacher perceptron T : {—1, l}''^ 1} whose fixed weight vector is S G 

Teacher and student output are given by the familiar recipes 

T(0=sgn[S-^], S(^) =sgn[J.|], 
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We assume that B is normalised such that = 1, the components being drawn randomly with 
mean zero and standard deviation of order 0{N~^/'^), and statistically independent of the input 
data. In order to model the bias in the input sample we assume that for ^ = (.^i, . . . ,Cn) £ 
{ — 1, 1}-^ all are independent, with (^j) = a, so that the probability of drawing ^ is given by 

i 

We define = a + Vi, such that the (independent) Vi have mean zero and variance o"^ = 1— a^, 
and the short-hand A = a(l, . . . , 1) (i.e. a vector with all N entries equal to a, to be referred to 
as the 'bias vector'). The teacher-bias overlap B ■ A is now a random parameter which is C(l), 
since {{B ■ A)"^) = o^^i^i^j) — ^ whose distribution will be Gaussian for — > oo, with 
mean and standard deviation a. 

The student perceptron S is being trained according to an on-line learning rule of the form 
Jm+i = Jm + ^Jm, where at each iteration step an input vector is drawn independently 
according to (|l|), and where 

Tj 

^Jrn = — ^mSgn(-B • t^) J^[\ J m\ , J m ' ^m,Sgn(B • ^„)] 

For Hebbian learning, for instance, we have 

J^[J,u,T] = 1 : AJm = -^^„sgn(B • ^^) 

whilst the familiar perceptron learning rule is defined by 

J'iJ, u, T] = e[-uT] : AJm = ^^rnHn{B ■ $J - sgn{Jm ■ (2) 

We will derive, from the microscopic stochastic process for the weight vector J, a macroscopic 
dynamical theory in terms of the familiar observables Q = and R = J ■ B, as well as (in 
order to obtain closure) a new observable S = J ■ A measuring the overlap between the vector 
J and the bias vector. The teacher and student output can then be written in the form 

S(^) = sgn[Ai + x], T{$) = sgn[A2 + y] with Xi = J ■ A, X2 = B ■ A, 

with J = J/\J\, and where the local fields {x,y,z} are defined hy x = J-v,y = B- v and 
z = A-v (the latter field z will also enter our calculation in due course). Note that Ai = S/^/Q. 
For large A^, the three fields {x, y, z} are zero-average Gaussian random variables, each with 
variance a"^ = 1 — a^, and with correlation coefficients given by 

{xy)=uja^ {xz)=a^S/\A\, (yz) = X2/\A\. (3) 

We note that equation implies that z will be independent of {x, y) for large N so that 

p{x,y,z) = [aV2^]-^e-''/^'''p{x,y), p{x,y) = [27rcTVl-w2] -%-|[^^-2c^xy+y2]/,2(i_^2) 

(4) 

with uj = J ■ B = R/y/Q. It will turn out that most of the averages to appear in this paper, in- 
volving (^) (to be written as (•••)), may be expressed in terms of the function K{x) = erf {x/\/2). 
The generalisation error Eg = {9[—{ J ■ $){B for example, can be written as 

Eg = Jdxdy p{x,y)e[-{Xi+x){X2+y)]= Ii{Xi-X2,^) + h(rXi,X2,^) (5) 
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where 



with the Gaussian measure Dy = (27r)^2e~22^ dy (see appendix]^ for details). This then gives 
E,= '--'-r Dy k(^4^\ + i rDy K(h^\ (6) 

Note that, due to the identity J^Dy K {ojy / V 1 — uj'^) = ^ — ^ arccosw, formula reduces, as it 
should, to the well known expression Eg = tt"^ arccosu; in the case where the input bias is zero 
(i.e. for a ^ 0). 



3 From Microscopic to Macroscopic Laws 

We now consider the dynamics of the macroscopic observables {Q, R, S} in the limit of large 
A^. In the bias-free case, where for large N the fluctuations in the macroscopic observables are 
insignificant, this can be done in a direct and simple way. Here, for a ^ 0, the situation is 
qualitatively different, since (as will turn out) the fluctuations in S will no longer vanish, and 
their distribution will have a strong impact on the macroscopic laws. In order to provide a 
setting for our theory we briefly review a well known procedure which enables us to pass 
from a discrete to a continuous time description. We suppose that at time t the probability 
that the perceptron has undergone precisely m updates is given by the Poisson distribution 
7r^{t) = -lj[Nt)"^e-^K For large this will give us i = f + 0{N-^/'^), the usual real-valued 
time unit, and the uncertainty as to where we are on the time axis vanishes as A/^ ^ oo. It is 
not hard to show that the probability density Pt{J) of finding the vector J at time t satisfies 



d 

di' 



'-pt{J)=N JdJ' f^{5[J-J'-AJ])^-5[J-J']^pt{J') 
where, for the perceptron learning rule (§), the single-step modification A J is given by 



A J = [sgn(B • ^) - sgn(J • |)] 

and where and (•••)^ denotes the average over all questions ^ in the training set { — 1,1}^. 
The macroscopic observables ft = {Q,R,S), in turn, have the probability density Pt{^) = 
JdJ pt{J)S[fl — fl{J)], which satisfies the macroscopic stochastic equation 



where 

Wt[Cl,fi'] =N{ {6[ft-ft{J+AJ)])^-6[ft-ft{J)] 
with the so-called sub-shell (or conditional) average (' ' j! defined as 

JdJ pt{J)5[n-n{j)]f{j) 
ui-^j/n.t JdJ pt{J)s[fi-fi{j)] ■ 
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It is possible to make various assumptions regarding the scaling behaviour of our observables at 
time t = 0, but once this has been specified the scaling at subsequent times is determined by 
the dynamics. We make the natural assumption that Q{0) = 0(1) so that, in accordance with 
our assumptions regarding the statistics of S, we have R{0) = 0[N~^/'^). We suppose that 
5(0) = 0(iVV2)^ ^Yie maximum permitted by the Schwarz inequality. 

In this context it is worth remarking that in the idealised case of zero bias, Hebbian learning 
is known to out-perform the perceptron learning rule; but in the more realistic situation of even 
moderately biased data the Hebbian rule fails miserably. For example, if we assume that S'(O) 
is 0(l),and that are initially 0(1), it follows from the learning rule (or from the methods 
which we apply below to the perceptron learning rule) that in the initial evolution of the Hebbian 
system dS/dr = rja^ K {X2 / cr) , where r = Nt, so that S rapidly diverges and no learning takes 
place; the student vector J cannot break away from its alignment to the bias vector. We shall 
show, however, that the perceptron has no problem coping with extreme initial conditions such 
as S'(O) = 0{N^^'^), and that in due course effective learning occurs. The Hebbian example 
also serves to show that, even if we were to choose the weaker initial scaling 5(0) = 0{N^), 
dependent on the specific choice we make for the learning rule, the order parameter S might 
well be driven towards S = 0{N^) states. 

A systematic exploration of the possible scaling scenarios reveals the following.^ For the 
perceptron learning rule and for the initial scaling conditions as specified above, the only self- 
consistent solution of the macroscopic equations is one describing a situation where the system 
passes through three phases {I, E, EI} defined by time scales t = {rA^^^/^, rA^^^, r}, in which our 
observables are 0{1) quantities in all three phases, with the exception of S which is 0{N^/'^) 
in phase I. We will write S = SN^/"^ in Phase I, with 5 = 0{N'^), and formulate our Phase I 
equations in terms of S rather than S. The number of iterations m is related to the original 
time i by m = Nt so that the number of iterations up to time r, in each of the three phases, 
is given by m = {tA^^/^, rA^^, rA^}. We incorporate these scaling properties into our equations 
in each of the three phases, by working henceforth only with 0{N^) time units r and 0{N^) 
observables fi, which satisfy 

-^PtW = j dn' Wr[^,n']Pr{n') (7) 

with 

>v,[n,n'] = Fi,n,n ( {5[n-n{j+^j)])^-5[n-n{j)] )^.^^ 

and Fi = A^^/^, = A^*^, -^m = A^- In a subsequent stage it will be convenient to write 
A J = k + k' , where 

k = ^A[sgn(B • ^)-sgn(J • ^)], k' = ^^[sgn(S • ^)-sgn(J • ^)] (9) 

so that 

1 z 

AJ A = -7?a^[sgn(A2+y)-sgn(Ai+x))] + r?a--y=[sgn(A2+y)-sgn(Ai+x))] (10) 
^ zv A 

^For brevity we will in this paper only describe the resulting self-consistent solution, which is indeed perfectly 
consistent with the observations in numerical simulations such as in figure ^. 
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We are now in a position to discuss the dynamics in each of the three phases in which different 
scahng laws apply. 



4 Phase I: Elimination of Bias-Induced Activation 

In Phase I we define the ©(A^^) observables ft = {S, Q, R) = {J ■ A/^/N, Q, R) and Fj = ^/N . 
Upon expanding the exponential ^-'^^■^{J+^J) in powers of A J we obtain from equation @ 



where 



A straightforward calculation using equation (^) and the two averages (sgn(A2 + y)) = K{^) 
and (sgn(Ai+x)) = K{^) = sgn(5) (which is valid for large N in phase I) now gives 

/. . A = lirja^[K(^) - sgn{S)]Cli + ir?^[K(^)-sgn(S)]A2. 
L J I 2 cr a 

We can now apply equation (|^ to compute the time derivative of the probability density Pr{ft). 
Note that the sub-shell average (• • ^ involves an integration over all J for which fi( J) = fi' 
(in a distributional sense) so in calculating the relevant integrals we may effectively replace fi( J) 

by n' at appropriate stages. For example, JdQ Oje*^'!^"^^'^)] = i{2TT)^dj'S[ft-fl'], where dj' 
denotes differentiation with respect to fl'j. We now find for Pr{S,Q, R) a Liouville equation 



'^-P,{S,Q,R) = -^\'^[K{^)-sgniS)]PAS,Q,R)^ ^ 



dT dS 



dQ 



r]S[Ki^)-sgn{S)]PriS,Q,R) 
a 



with the deterministic solution Pr{S, Q, R) = 5[S—S{t)] S[Q—Q{t)] 6[R—R{t)], where the actual 
deterministic trajectory {S{t),Q{t), R{t)} is the solution of the coupled flow equations 

= ir?a2[i^(^)-sgn(5)], -f Q = r?5[i^(^)-sgn(5)], ^R = 0. 

uT z a dT a or 

It follows that S{t) = 5(0) + ir?aV[K(A2/cT)-sgn(S)]. We see that S is driven to zero in times 
T = T± (with lb referring to the cases 5*0 > and 5*0 < 0, respectively), which are given by 

2|5o| 



r?a2(l^K(^))- 

Irrespective of the value of So, the system seeks to eliminate any strong alignment of the learning 
vector J relative to the bias vector A. This is clearly confirmed by numerical simulations. Our 
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equation for Q also readily integrates to give Q = Qo + [5*^ — -SgJ/a^. We see that the length 
J = \fQ of the student weight vector decreases and that J ^ [-^o ~ '^'g/a^ja as r t±. 
Again, this is confirmed by numerical simulations. The equation dR/dr = implies that ujJ is 
constant in Phase I. As can be clearly seen in figure ||, no learning takes place in this phase, 
since expression (^) for Eg reduces to Eg = ^[1 — sgn(5)-fr(A2/cr)] in the limit |Ai| — > oo (note: 
Ai = S/ J). However, at times r approaching t± it is no longer valid to argue that S is 0(\/iV); 
it is now 0{N^) and we enter the scaling regime of Phase II. 



5 Phase II: Transition to Error Correction 

As shown in the previous section, S is an 0{N^) quantity in phase II, and it is also clear that 
{Q^R} are 0{N^) at the start of phase II. In phase II (and, as we will see, also in phase III) 
we have to consider the observables = (5,$), with $ = {Q,R); the reason for this slight 
departure from our phase I terminology will soon become clear. We can now express equation 
® as 



Here 

a$„ 1 ^ _ _ 



^>^( j+Aj) = $^( J) + J2 ^Jiirr + o E ^Ji^J^ 



dJi 2^ 'dJidJj 

(this expansion is exact, since {Q,R} are quadratic and linear functions, respectively). Substi- 
tuting and expanding the exponential gives 



n' 



where 



Note: whereas it is valid to expand e ^^-^(J+^J) in the manner just described, we cannot treat 
^-iSSiJ+Aj) i-i^g gg^^g ^j^y gij^pg AJi{dS/dJi) = A AJ = 0{N^) in phases II and HI. 
Equations (^,11) form the basis for our study of Phases II and HI. 

The time scale r in Phase II is related to t via t = tN-^, so that Fn = A^°, but although 
this phase is of short duration it has an important role as regards the stochastic evolution of 
the bias overlap parameter S. It is straightforward to show that the third term in equation ([Tl| ) 
makes no contribution in the limit of large N. Moreover, in the very short phase II we may 
approximate A J ■ A hy k ■ A ([To|). Referring to the details and notation in the appendix we 
have 

^^-^sk■A^ = l-Eg + e^^^'^/U-Ai, A2,^) + e-^''»'^/i(Ai,-A2,^). (12) 
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Ai 



Figure 2: Histogram of values for Ai = S/J as measured in a single simulation experiment with 
N = 400, 000, T] = 1, and a = ^, during the interval t G [0, 0.07]. Here phase I is absent, by virtue 
of the choice S{0) = 0, and A2 = 0.287. The stars indicate the predicted occurrence probabilities 
as calculated from (|^). On the short time-scale of observation the observed distribution for Ai 
is truly discrete: no values of Ai were found in between the centres of the histogram bars. 



and we then find that in phase II 

Wr[n, n'] = hi-Xi, X2,^)d[S-S'^a^]6[^-^']+Ii{Xi,-\2,uj)6[S-S'-r]a'^]5[^-^']-Eg6[n-n']. 

Substitution into (|^ and repetition of the arguments used for phase I we find that Q and R 
remain constant in phase 11, whilst the conditional distribution Pt-{S\Q, R) satisfies 

■^Pr{S\Q, R) = h{Xi{S-)-X2,MPAS~\Q, R) + h{-Xi{S+), X2,MPr{S+\Q, R) 

~Eg{S,Q,R)Pr{S\Q,R) 

where = S i: rjo? . The distribution equilibrates, on the relevant time-scale, to a stationary 
distribution P{S\Q,R) given as the solution of 

Eg{S,Q,R)P{S\Q,R)=h{Xi{S-)-X2,MP{S-\Q,R) + h{-Xi{S+),X2,^)P{S+\Q,R). 

Using relation (P) we find that this equilibrium condition can be written as A{S) + B{S) = 
A{S+)+B{S-), where A{S) = Ii{-Xi{S), X2,^)P{S\Q, R) and B{S) = Ii{Xi{S)-X2,^)P{S\Q,R). 
One can easily show by taking Fourier transforms that it is satisfied by B(S) = ^4(5'^), the 
correctness of which is evident by substitution. In this phase the permissible values of S 
are those which differ from some initial value ^(O) by an integral multiple of r/a^. Upon 
writing the allowed values of as 5„ = 5(0) + n-qa^, we immediately obtain P{S\Q,R) = 
J2'?!!'=-ooWiSn+i\Q,R) d[S-Sn], where 

w{Sn+l\Q,R) = , wiSn\Q,R), (13) 

/i(-Ai(n+l), A2,-w) 
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with Ii as given in (H). Equation ([|) fully deter mines the quasi-stationary distribution 
P{S\Q, R). Comparison with numerical simulations shows very satisfactory agreement, see e.g. 
figure |2[ The above picture is also in line with our intuition, since in a single step the change in 

5 is given by 

1 1 z 

AS = AJ ■ A = -r/a^[sgn(A2+y)-sgn(Ai+2;)] + -r?a^=[sgn(A2+y)-sgn(Ai+x)]. 
z z y A* 

Provided we can neglect the A^~2 term in this expression, which is true on the time scale of 
phase II, we see that in a single update A5 E {0, itr^a^}. However, if the A^~2 term could 
be neglected indefinitely this would imply that, far into the future, the system would retain a 
memory of its initial conditions. In fact the term ^r/a2;[sgn(A2+y)— sgn(Ai-|-x)]/-v/iV represents 
a random walk superposed on the quasi-stationary distribution found for S in phase II. 

6 Phase III: Error Correction 

As we enter phase III, where -Fn = A^, the above 'random walk' term will come to have a 
significant role after about N iterations^, leading to a modified probability distribution which 
contains a diffusion term: Sn ^ Sn + s{t). The walk is given by 

Nt 

= ^rfe X! 2(^)[sgn(A2+y(/i))-sgn(Ai(/i)+a;(^))] 
2V A 

in an obvious notation, where the fields z{fi) are, as we have seen earlier, independent of (x, y). 
The random walk addition s{t) has mean zero, and variance given by 

2 2 2 

{s\t)) = 5:([l-sgn(A2+2/(/x))sgn(Ai(/.)+x(M))]) = t{riaaf{E,) (14) 

where {Eg) is to be interpreted as a time average of Eg over phase III, up to time t. 

In order to extract the macroscopic laws in phase III we will now have to analyse this diffusion 
effect carefully, starting from equation (11). The details of this analysis are given in appendix 



where we show that for large A the macroscopic distribution in phase III will again be of 
the form Pt[S^ Q,R) = Pt{S\Q , R)6[Q—Q{t)]5[R— R{t)], but now with the deterministic values 
{Q{t), R{t)} given as the solution of the coupled equations 



d 

d^ 



Q = r]VQ JdS iKi+Li+Mi)PriS\Q,R) + ^r]^ JdS iK3+L3+M3)Pr{S\Q,R) (15) 

= dS {K2+L2+M2)Pr{S\Q, R) (16) 

The factors {Ki, Li, Mi}, defined in appendix are indeed functions of S (via Ai) and of {Q, R}. 
The origin and meaning of these two equations can be appreciated more clearly by writing them 
in the following, somewhat more appealing, form (without as yet specifying the learning rule 

—Q = 2r]J J dS Pr{S\Q,R) ((Ai+x)sgn(A2+2/).F[v^, Ai+x, sgn(A2+y)]) 



We are grateful to Peter SoUich for pointing this out. 
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Figure 3: Evolution of the order parameters J = \ J\ (upper) and S = S/yN = J ■ A/yN 
(lower), for a = ^ and 77 = 1. Markers indicate simulation results (for N = 1000), the solid line is 
the theoretical curve obtained by numerical solution of equations ( p^Jl^ , p7D . Initial conditions: 
(5(0) = 10, i?(0) = and 5(0) = 1. On the time-scale t = fi/N only phase III is visible. The 
inset shows a magnification of the initial stage of the process, where phase I can be observed. 



d 



+ rf jdS Pr{S\Q, R) Ai +rE, sgn(A2+y)]) 

R = r, JdS Pr{S\Q,R) (|y|.F[v/Q,Ai+x,sgn(A2+2/)]) 



(see 1 13] for details). Although equations ( |T5| , |T^ ) are superficially similar to the equations which 
we derived in phase I, we now have a situation in which functions of S are weighted with respect 
to the probability distribution Pr{S\Q, R) which satisfies a partial differential equation derived 
from equation (^) (in appendix P) by integration over Q and R, namely 

±P,{S\Q,R) = 



N 



/i(-Ai(5+, X2,^)Pr{S+\Q, R) + h{Xi{S~-X2,MPr{S-\Q, R) - Eg{S, Q, R)Pr{S\Q, R) 



,^222 
+ -r? a a 



[/i(-Ai(5+),A2,^)P.(5+|g,i?)] + ^[/i(Ai(5- 



-\2,^)Pr{S~\Q,R)] 



(17) 



Equations (15,161,1^), together with the definitions of the short-hands {Ki, Li, Mi} as given in 
appendix ^, provide an exact and closed set of equations for the macroscopic dynamics in phase 
III, in terms of the observables {S", In the large N limit, Q and R satisfy deterministic 

equations, as in conventional no-bias theories, but S remains stochastic throughout phase III. 
Furthermore, the persistent appearance of the factor A2 (which depends on the actual realisation 
of the teacher weights) induces sample-to-sample fluctuations. An example of the result of 
solving the coupled equations ( p^Jl^ ,17) numerically (via a numerical realisation, i.e. Monte 
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Carlo, of the conditional stochastic process ( [T7| ) for S) is shown in figure ^, and compared with 
numerical simulations of the underlying microscopic perceptron learning process. The agreement 
between theory and experiment is quite satisfactory. 



7 Asymptotics of the Generalisation Error 

A full numerical study of our equations (|l5|,^jl^) would be difficult, but these equations undergo 
a great simplification, permitting further analysis, if we make the approximation Pt-{S\Q, R) = 
6[S — (S)], and assume that Xi{{S)) = A2; numerical simulations confirm the validity of the 



replacement of Ai by A2 on average in phase III. In this approximation equations (|15,16) become 



CLT Z dT Z 

Note that Ki + Li+Mi = Aipi +5i + - (A2 + S2 + C2)] + (^3 + ^3 + C^3) - (^4 + 54 + C4). 
Referring to appendix ^ for the relevant expressions for {Ai,Bi,Ci} in terms of the integrals 
/i(Ai,A2,u;) and l2(Ai, A2, w), and using the identity K{a) = Dy K{{a-ujy) / y/l —oj"^), we 
find that in the approximation Ai = A2 the following identities hold: 

Ai+Bi + Ci = K{\2/a), A2+B2 + C2 = K{\2/a) 

[2 [2 [2 _ ^2 

A-i+B-i+C^ = \ —ujae 2^, Ai+B4,+Ci = \ —ae Ki+Li+Mi =-\ —a{l-uj)e 2^. 

V vr V vr V vr 

K2^L2^M2 = {A^ + B^ + C^) - (^e + ^e + Ce) = \J^a{l-uj)e~^ 

^ I - /_; . /: - 

and equations (^ IC) therefore become (upon rewriting the equation for Q in terms of J = \/Q)'- 



J = ^a{l-uj)e~^ + ^Eg —R=—^a{l-uj)e'^ (18) 



dr V27r 2J ^ dr 

The corresponding equation for lij = i?/J is 



which is to be solved in combination with (^). Numerical solution of these equations is found 
to be in very good agreement with the results of numerical simulations, even for finite times; 
however, it is relevant to consider what basis exists for making the approximation Ai = A2, 
other than the fact that it works. We have already observed that the probability distribution 
for S in phase III is a random walk superposed on the underlying discrete distribution which 
emerged in phase II. Equation ( [l^ indicates that the random walk, reflected in the diffusion 
terms in equation (|l^), could in principle lead to a large variance for S, were this random walk 
not coupled to the underlying discrete distribution via equation ([T7|). The discrete distribution 
and the random walk, however, are found to interact in such a way that the fluctuations actually 
tend to zero in the limit r — > cxd; this is confirmed by the results of numerical simulations which 
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20.0 - 




Figure 4: Distribution p{Xi) of Ai = S/J, as measured during a single simulation run, with 
N = 1000, a = i and ry = 1, over the time intervals [0,100] (dotted curve), [950,1050] (dashed 
curve) and [9900,10000] (full curve). One observes that the fluctuations in Ai are reduced to 
zero, as time progresses. 



show that the fluctuations in Ai = S/J decrease with time and that on average Ai tends to A2, 
see figure ^. In a single step the average change in S is equal to 

Jr/a2([sgn(A2+y)-sgn(Ai+x)]) + ir/a^([sgn(A2+y)-sgn(Ai+x)]) = lr]a^[K{^) - K{^)] 



so, as the fluctuations in S diminish, we do indeed expect that Ai will tend to A2. 

We will now use the coupled equations (|^,|l9|) to derive an asymptotic expression for the 
generalisation error Eg. Differentiation of (^) with respect to uj gives 

dEg^ ^ _^:!!Lg-i/32(i-^)/(i-K.) 



duj 7r\/r 



,2 



with the constant (3 = A2/cr. Changing the variable to uj = cos 9, and expanding for 9^0 gives 

Equation (|l^) for J and equation (^) for u can now be written 

— ^(l-cos6')e 2P+-L-^ - J sin9— = -^sm.^ 9 e 2^-2 — i 



dr ^/2^^ 2J dr ^/2n 2J 

Using the expansion tan^ = 9 + ^9^ + 0{9^) we then expand our previous equations for the 
evolution of J and 9, giving 

^9 = e-¥^ + ^ - ^1 + 0{9% with p = + i 
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Upon making the asymptotic ansatz J = A/ 9, the equation for J can now be expressed so as to 
give a second equation for 9. The two resulting equations for d9 /dr are 



d 



r]a9^ 



2AV27r 



e 



2/34 



2A2 



TT 



2^2 



24 



+ 0{9^) 



e 2/^ 



and 



rja9^ 



Ir2 
2P + 



2ttA^ 



e 2^^ +0{9^) 



Consistency requires that A be given by A 
sequently becomes d9/dT = —^a'^9^e~ 
9 = kT°', where a = — | and — '^^-P^ 
recalUng that in phase III one simply has r 



r]/ay/2TT. The asymptotic equation for 9 sub- 
, from which we obtain the asymptotic power law 
2e^^ {"ia^ . Combining this, finally, with ( [20| ) we then obtain, 



m/iV = t: 



Eg{t) = p{a)e 



-A2/3.2^-i 



00 



p{a) 



37r3 



{21) 



Note that the power of r occurring in this expression is the same as the power which appears in 
the asymptotic form of the generalisation error in the conventional no-bias theory; the coefficient 
is however different, but reduces to the familiar form in the case of zero bias, where a = A2 = 
and a = 1. Moreover, our prediction of the asymptotic form of Eg is in excellent agreement with 
the results of numerical simulations. This is evident from figure (^), where we show the observed 
function p(a), defined as p{a) = lim^^+oo Eg{t)t3e^2/^°' , versus the theoretical prediction as given 
in (|2l] ) . Note that the dependence of (^) on the teacher-bias overlap X2 = B -A implies sample- 
to-sample fluctuations. 



8 Discussion 



We have studied analytically the dynamics of on-line learning in non-linear perceptrons, trained 
according to the perceptron rule, for the scenario of having structurally biased, i.e. 0{N^), input 
data. The bias changes qualitatively the learning process, inducing three distinct phases (with 
different scaling properties) and persistent stochastic as well as sample-to-sample fluctuations 
in the generalisation error, even for 00. At a theoretical level, the need to introduce an 

extra order parameter S (the projection of the student weight vector in the direction of the 
bias) which is neither deterministic nor self-averaging makes the analysis considerably more 
involved than that of the idealised bias free case. In the third and final phase, in which most 
of the learning takes place, we have obtained a set of exact closed equations which involve the 
conditional probability density of S. However, because of their complicated nature, an exact 
analytic solution of these equations appears to be out of the question, as is also generally the 
case in the more familiar no-bias scenarios. Nevertheless we have found that an approximate 
(and much simpler) version of our equations yields results which are in excellent agreement with 
numerical simulations. We show that the asymptotic power law for the generalisation error is 
largely preserved, with the bias showing up only in the pre-factor. At various stages throughout 
out calculations we have compared the predictions of our macroscopic dynamic equations with 
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0.2 I ' ' ' ' ' ' ' ' ' 1 

0.0 0.2 0.4 0.6 0.8 1.0 

a 

Figure 5: Comparison of p{a) as found in simulations (A^ = 1000 and r] = 1, see the main text 
for details of its definition) , for various values of the teacher-bias overlap X2 = B ■ A (squares) , 
with the theoretical prediction (|2l] ) (solid curve). 



the results of numerical simulations of the underlying (microscopic) learning process, which 
consistently showed excellent agreement. 

Although in this paper we have confined ourselves to the perceptron learning rule, it is clear 
that our analysis is in no way restricted to this particular rule, and can be applied to other rules 
such as the AdaTron learning rule, where AJ = 2^^[sgn(S-^)— sgn(J-^)]| J-^|; one could even 
study optimal learning rates and optimal learning rules, generalising [|lj] to the case of having 
a 7^ 0. Preliminary studies of the AdaTron learning rule with structurally biased data show, 
for instance, that the simple result (|T3D, describing the phase II distribution in the case of the 
perceptron, is replaced by the integral equation 

/oo ^ 
dp G{p,\2) Pr{S+{\i+p)ria^J)\Q,R) 
- Ai 

POO ^ 

+ ric?J \ dp G{p-\2)Pr{S+{\i- p)r]a^ J)\Q,R). 
JAi 



where G is defined by 

G{X,\2 



l-K 



The discrete distribution which in the present paper we found for the perceptron in phase 
II no longer applies in the Adatron case, and is replaced by a continuous distribution which 
satisfies the above integral equation. The analysis of the AdaTron in the case of biased data 
is more complicated than for the perceptron, as might have been expected from the nature of 
the AdaTron learning rule, but much of the work which we have presented for the perceptron 



can be carried through and the results will be published in [15|. There is also scope for a 
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more detailed mathematical investigation of the partial differential equation which we derived 
to describe the conditional probability distribution Pt-{S\Q, R) for the perceptron, but this is 
likely to be difficult, and beyond the scope of the present paper. 
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A Integrals and Averages 

We recall that the function K is defined by K{x) = erf(x/\/2). In terms of this definition note 
that f^dC = ^V2^[l- K{r)]. We now proceed to list various integrals which occur in 

our calculations, or are referred to in the text, and where appropriate outline a brief derivation. 
Recall that the joint distribution of {x, y) = {J ■ v,B ■ v) is given by 

p{x,y) = [2Tra^Vl-ui'^] e ^ -^d— ^) 
where cr^ = 1 — and u = B ■ J. We then find that 



^i(Ai, A2,u;) 



OO /"OO 



dx / dy p{x, y) 
Ai J A2 



/ -JL. e ^ dC e~2 

A2 27rcT / 



1-K 



'-TDyKi^^ 



(22) 



and similarly 

/•OO /"OO /"O 

/2(Ai, A2,a;) = / dx x dy p{x,y) = / 

1-K 



°° dx x 

e 2?2^ 



2(7V27r 



/° 



°° dx X 

e 



2V27r 



a2 
_ 1 

e 2^ 



A2— c<;Ai 



+ - — 7=e 2<^- 



2V27r 



2V27r(7 

Ai — a;A2 



Ao— 



^^/T- 



-w 



1-K 



as/T—LO^^ 



The following averages with respect to the distribution p{x, y) are easily calculated: 

/Ai\ [2 _4 [2 ^ 

(sgn(Ai+x)) = K{ — ], {x sgn(A2+y)) = \ -ujae ^ , {x sgn(Ai+x)) = \ -< 

V (7 / V TT V TT 



ere 



(sgn(Ai+x)sgn(A2+y)) = /i(Ai,A2,c^) - /i(-Ai, A2,-a;) - /i(Ai,-A2,^) + /i(-Ai,-A2, w) 



7T^ 



^2 



Xi—ujay 



Finally, in studying phases II and III we require the following averages: 



(sgn(A2+y)e 






iSr)o? 


(sgn(Ai+x)e~ 


iSk-A\^ _ 






[x sgn(A2+2/)e 


-isk-A^ 


= ^3 + ^36*^""' +^36 




(x sgn(Ai+x)e 


-isk-A^ 


= ^4 + ^46'^''"' +C4e 


—iS-qa? 


(y sgn(A2+y)e" 


-iSk-A\^ 






(y sgn(Ai+x)e 


-iSk-A'^ 







iSk-A\ 



^7 + ^76^^'''^' +C7e-*^^«' 



(sgn(Ai+x)sgn(A2+y)e-^^^-^) = ^g+^se'^'""' +^86-^^^"' 
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wilt;! c 












Ai 


= -[/i(Ai, A2,u;) - /i(-Ai,-A2,i:j)], 


Bi 


= --^i(-Ai, A2-W), 


Ci 


= /i(Ai,-A2,-a^), 


A2 


= -[/i(Ai, A2,a;) - /i(-Ai,-A2, w)], 


B2 


= A(-Ai, A2,-a;), 


C2 


= -/i(Ai,-A2,-a;) 


^3 


= [/2(Ai,A2,u;) + /2(-Ai,-A2,t^)], 


B3 


= -/2(-Al, A2,^), 


C3 


= -/2(Ai,-A2,-a;) 


^4 


= [l2(Ai, A2,u;) + /2(-Ai,-A2,t^)], 


B4 


= -/2(-Al, A2,^), 


C4 


= ^2(Al,-A2,^), 


^5 


= [/2(A2, Ai,u;) + /2(-A2,-Ai,u;)], 


B5 


= -f2(A2,-Ai,-a;), 


C5 


= ^2(-A2, Ai,^), 


^6 


= [/2(A2, Ai,u;) + /2(-A2,-Ai,u;)], 


Be 


= -^2(A2,-Ai,^), 


Ce 


= -I2(-A2, Ai,-a;) 


^7 




Br 


= /i(-Ai, A2,-a;), 


C7 


= ^i(Ai,-A2,^), 


^8 


= [Ii(Ai, A2,u;) + /i(-Ai,-A2,a;)], 


Bs 


= -^i(-Ai, A2,^), 


Cs 


= -A(Ai,-A2,-a;) 



All these formulae may be established by elementary methods. For example, 
{x sgn(A2+y)e-^'^^-^ 



dx X + 



dx X 



Ai 



dxdy X p{x,y)sgn{X2 + y)e~2iSva [sgn{A2+s,)-sgn(Ai+x)] 
dy e-t»^'?«-U-sgn(Ai+xO)p(^^y) 

dy el»-5'?«'(l+'5gn(Ai+^))p^^ _y) 



A2 



A2 



-Ai roo 

dx X + dx X 

00 J—Xi 

dx X dy p{x,y) — dx x dy e^^^"" p(x,— y) 

Ai J X2 J — Ai A2 



dx X dy e p{x, —y) + / dx x j dy p{x, y) 

Ai J ~X2 J — Xl J —X2 



-f2(Ai, A2,u;) 



/2(-Ai,A2,-w) 



/2(Ai, -A2, -w) + l2(-Ai, -A2,w) 



^3 + ^36*^"" +C3e 



-iSrja 



as required. 

B Analysis of Macroscopic Distribution in Phase III 

Here we give the details of our analysis of the macroscopic distribution Pr{S, Q,R) in phase III, 
starting from equation (11). We note that, in phase III: 



I 2Vn 



{rjazSy 



[l-sgn(Ai+x)sgn(A2+2/)] + • 



The terms which we neglected are 0{N ^), since when performing averages over the training 
set the average of the term is zero. Equation (|ll]) now yields 

Wr[n,n'] = N (e^"-l"-"('^)ke-^^^-^-l-^^^e-^^-^[l-sgn(Ai+x)sgn(A2+y)])^ 



^in-in-fiij)] 



(2vr) 



(23) 
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where 



52$ 



dJi dJ. 



We showed in appendix |A| that 



dxdy p(2;,?/)sgn(Ai+x)sgn(A2+y)e' 



isk-A 



(24) 



/i(Ai,A2,a;)-/i(-Ai,A2,^)e*^''"' 



so that 
dxdy p{x, y)e 



-Ii(Ai -A2,^)e-*^^'^' + /i(-Ai -A2,a;) 
-^^•^[l-sgn(Ai+x)sgn(A2+y)] = 2[e^^'"^'/i(-Ai, A2,^) + e-^^^'^'li(Ai -A2,^)], 



by virtue of equation (12) and the fact that /i(Ai, A2,a;) + /i(— Ai,— A2, w) = I — Eg. Bearing in 
mind the sub-shell average we may write 

jdS 52 e'S[S±,'^^-sm ^ _2^^^^s± rials'] 



Upon combining equations ([7|, p^p^ ) we find that in phase III the joint probability density 
Pr{S,Q, R) satisfies 



d 



Pr{S,Q,R) = N 



dT 

-Eg{S,Q,R)PriS,Q,R) 



h{-Xi{S+), X2,^)Pr{S+, Q, R) + /i(Ai(5-) -A2,^)Pr(5-, Q, R) 



52 



h{-Xi{S'),\2,^)^5[S+rja'-S'] 



where s • • • [■ is given by equation (24), and hence 



+Ii(Ai(5')-A2,^)^<5[5-7?a2-5'] 



(2vr)= 



—Pr{S,Q,R) = N 



/i(-Ai(S+), A2,^)Pr(5+, Q, R) + h{Xi{S-)-\2,MPr{S-,Q. R) 



-Eg{S,Q,R)Pr{S,Q,R) 



+ -r] a a 



+ Q^[h{XiiS~),-\2,^)PriS- ,Q,R)] 



dfl' 



-h{-Xi{S+),X2,MPr{S+,Q,R)] 



(25) 



As regards the evaluation of 



we note that 



^E(^^^e"'^^'^)^^M = ^?J((Ai+x)[sgn(A2+y)-sgn(Ai+x)]e-^^^-^)l>i 



iij, 
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+^r/((A2 +x) [sgn(A2 +y) -sgn(Ai +x)]e-^^^-^)$2 

in which 

Ki = Ai(Ai-^2) + (^3-^4), K2 = A2(^l-A2) + (^5-^6), 

Li = Ai(Si-S2) + (S3-54), L2 = \2{Bi-B2) + {B^-Bq), 

Ml = Ai(C7i-C72) + (C3-C4), M2 = A2(Ci-C2) + (C5-C76) 

and Ai, Bi, Ci are functions defined in appendix ^ and expressed in terms of the integrals 
/i(Ai, A2, and /2(Ai, A2,cl'). In a similar way we find that 

^T.ik^k^-^^^~''^'^)^^f^ = r/'([l-sgn(Ai+x)sgn(A2+2/)]e^^^-^)|.i 



where ii's = ^7 — ^81 -^3 = Bi — Bg, and M3 = Cj — C^. Note that the term 



makes no contribution to yyr[i^,r2'] in the limit of large N. Using equations (|2^ and (|25|) we 
can now carry out the remaining integrations using standard formulae from distribution theory, 
as described for earlier phases, and find that 



d_ 



N 



PriS,Q,R) = 

h{-Xi{S+), X2,^)Pr{S+, Q, R) + /i(Ai(5^), -A2, -u;)Pr{S-, Q, R) - Eg{S, Q, R)Pr{S, Q, R) 



I 2 2 2 
+-7] a a 



^[I,{-X,{S+),X2,^)Pr{S+,Q,R)] + ^[h{Xi{S-)-X2,^)Pr{S-,Q,R)] 



r]J[KiPr{S, Q, R)+LiPr{S\ Q, R)+MiPr{S-, Q, R)] 



_d 

"d~Q L 

+]^^''[K^P,{S, Q, R)+LMS\ Q, R)+MsPr{S-, Q, R)] 



_d_ 
'dR 



-V[K2Pr{S, Q, R) + L2PriS+, Q, R)+M2PriS-,Q, R)] 



(26) 



Integration over S now gives, in combination with the relation Eg = Ai, A2,-a;)+/i(Ai,— A2,-ct'): 



PriQ,R) 



r]J jdS iKi + Li+Mi)PriS\Q,R) 



+^V^ JdS {K^+L^+M^)Pr{S\Q,R) 



~WQ,R) 



ir? jdS {K2+L2+M2)Pr{S\Q,R) 



which is a Liouville equation with solution Pt-{Q,R) = 5[Q — Q{T)]d[R— R{t)], where the deter- 
ministic flow trajectories {Q{t), R{t)) are given as the solutions of ( p!5|JI^ ), as claimed. 
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